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DETAILED ACTION 

1 . Claims 1-12 have been examined. 

Drawings 

2. The drawings are objected to under 37 CFR 1 .83(a). The drawings must show every 
feature of the invention specified in the claims. Therefore, the complete subject matter of claims 
1-12 must be shown or the claims will be cancelled. No new matter should be entered. 

Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to 
the Office action to avoid abandonment of the application. Any amended replacement drawing 
sheet should include all of the figures appearing on the immediate prior version of the sheet, 
even if only one figure is being amended. The figure or figure number of an amended drawing 
should not be labeled as "amended." If a drawing figure is to be canceled, the appropriate figure 
must be removed from the replacement sheet, and where necessary, the remaining figures must 
be renumbered and appropriate changes made to the brief description of the several views of the 
drawings for consistency. Additional replacement sheets may be necessary to show the 
renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an 
application must be labeled in the top margin as either "Replacement Sheet" or "New Sheet" 
pursuant to 37 CFR 1 .121(d). If the changes are not accepted by the examiner, the applicant will 
be notified and informed of any required corrective action in the next Office action. The 
objection to the drawings will not be held in abeyance. 



Specification 
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3. The abstract of the disclosure does not commence on a separate sheet in accordance with 
37 CFR 1 .52(b)(4). A new abstract of the disclosure is required and must be presented on a 
separate sheet, apart from any other text. 

Claim Objections 

4. Claims 1-12 are objected to for failing to meet the following requirements: all claims 
must begin with a capital letter, end with a period, and must contain only one sentence. 
Appropriate correction is required. 

Claim Rejections - 35 USC § 112 

5. The following is a quotation of the second paragraph of 35 U.S.C. 1 1 2: 

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the 
subject matter which the applicant regards as his invention. 

6. Claims 1-12 are rejected under 35 U.S.C. 1 12, second paragraph, as being indefinite for 
failing to particularly point out and distinctly claim the subject matter which applicant regards as 
the invention. 

7. Claims 1-2 and 5-12 recite the limitation "and/or", the meaning of which is not clear in 
the context of claim language. For the purpose of examination, the broadest interpretation of the 
phrase, "or", will be used. 

8. Claims 1 and 2 recite the limitation "the symbolic machine" on page 43, line 26 and page 
47, line 3, respectively. There is insufficient antecedent basis for this limitation in the claims. 
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9. Claims 1 and 2 recite the limitation "one or more data caches at different memory 
hierarchy levels" in line 5. It is not clear what it means for a single data cache to exist at 
different memory hierarchy levels. Appropriate correction or clarification is required. 

10. Claim 2 recites the limitation "determine a program counter value which is associated 
with said region (R2) determine which part of the information (IF7) to (IF 11) is common both to 
said region (R2) and to a region other than region (R2)" in lines 18-20 of page 49. The meaning 
of this phrase is not clear and leaves the scope of the claim indefinite. 

1 1 . Claim 4 recites the limitation "said region (Rl)" in line 6. There is insufficient 
antecedent basis for this limitation in the claim. 



Claim Rejections - 35 USC §103 

12. The following is a quotation of 35 U.S.C, 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

13. The factual inquiries set forth in Graham v. John Deere Co. , 383 U.S. 1 , 1 48 USPQ 459 
(1966), that are applied for establishing a background for determining obviousness under 35 
U.S.C. 103(a) are summarized as follows: 

1 . Determining the scope and contents of the prior art. 

2. Ascertaining the differences between the prior art and the claims at issue. 

3. Resolving the level of ordinary skill in the pertinent art. 

4. Considering objective evidence present in the application indicating obviousness 
or nonobviousness. 
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14. Claims 1-4 and 7-10 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Patterson et al. (Computer Organization and Design: The Hardware/Software Interface), 
referenced from here forward as Patterson, in view of Yeh et al. (Alternative Implementations of 
Two-Level Adaptive Branch Prediction), referenced from here forward as Yeh. 

15. Regarding claims 1 and 2, Patterson discloses a method for executing structured symbolic 
machine code on a microprocessor [page 111, paragraph 4; the binary contents of instructions 
represent logical processor entities such as registers and addresses in memory], wherein said 
microprocessor is pqrt of a data processing system containing a memory system [page 111, 
paragraph 4; page 541, paragraph 2; the computer contains a memory system], where said 
memory system is defined to have a memory hierarchy [page 541, paragraph 2; the memory 
system is implemented as a memory hierarchy] containing a register file [page 512, Figure 6.58; 
the processor contains a register file], a data cache [page 541, paragraph 3; the processor 
includes a cache implemented using SRAM], and a main memory [page 541, paragraph 3; the 
processor contains a main memory implemented using DRAM], where said microprocessor has 
an instruction set containing one or more instructions of which an operand may specify a 
symbolic variable [page 111, paragraph 4; the binary contents of instructions represent logical 
processor entities such as registers and addresses in memory], where said structured symbolic 
machine code contains one or more regions [page 118; instructions contain multiple fields], 
where one of said regions contains symbolic machine code containing information, where said 
information contains a symbolic constant of said region and the value of the said symbolic 
constant [page 118, paragraph 6; the Utype instruction, used by data transfer instructions, 
contains an address field for specifying with an immediate, or constant, the address in memory 
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that will be accessed], where said information may be stored into a memory [page 512, Figure 
6.58; instructions may be stored in an instruction memory], where the symbolic machine code 
contained in each of said regions contains an instruction of which one operand specifies a 
symbolic variable [page 118, paragraph 6; the address field of an I-type instruction represents the 
address at which the memory hierarchy will be accessed], where the symbolic variable specifies 
one or more entries of a memory other than a register file of said microprocessor [page 118, 
paragraph 6; the address field of an I-type instruction represents the address at which the 
memory hierarchy will be accessed], where said entries are used by the microprocessor in order 
to determine the addresses within the memory hierarchy where the values of said symbolic 
variables may be stored to or loaded from during execution of said structured symbolic machine 
code [page 118; when a memory access instruction is executed, the address field is used to 
determine the address at which the memory hierarchy will be accessed]. 

Patterson does not explicitly disclose that the microprocessor is able to perform 
speculative branch prediction, where said speculative branch prediction is based on a branch 
history which may store outcomes of branches which are not yet resolved at the point in time 
when a branch prediction is being made, where unresolved branch outcomes may update counter 
states within the branch history, and where said counter states may concern counter states stored 
in a pattern history table. 

Yeh discloses a microprocessor that is able to perform speculative branch prediction 
[abstract], where said speculative branch prediction is based on a branch history which may store 
outcomes of branches which are not yet resolved at the point in time when a branch prediction is 
being made [page 127, section 3.1, paragraph 2; the branch history is updated speculatively 
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before the results of the branch are known], where unresolved branch outcomes may update 
counter states within the branch history [page 126, second column, lines 5-1 1; a branch history 
counter is updated using the predicted results of branch instructions]. Yeh teaches that using a 
branch predictor increases the performance of a processor [abstract] and that speculatively 
updating branch history counters can increase the accuracy of the predictor [page 127, section 
3.1, paragraph 2], resulting in a further increase in processor performance. 

It would have been obvious to one of ordinary skill in the art at the time the invention 
was made to perform speculative branch prediction using speculatively updated branch history 
counters in the processor of Patterson because Yeh discloses such a branch predictor and teaches 
that it can improve the performance of a processor [abstract; page 127, section 3.1, paragraph 2]. 

1 6. Regarding claim 3, Patterson in view of Yeh discloses a method as claimed in claim 1 , 
wherein one of said regions contains symbolic machine code containing information, where said 
information contains a symbolic constant of said region and the value of the said symbolic 
constant [Patterson, page 118, paragraph 6; the I-type instruction, used by data transfer 
instructions, contains an address field for specifying with an immediate, or constant, the address 
in memory that will be accessed], wherein the same one of said regions contains a symbolic 
variable of said region [page 118, paragraph 6; the I-type instruction contains an opcode field to 
indicate that it is of the I-type format] as well as a label specifying a dependence group which the 
symbolic variable pertains to [page 118, paragraphs 1-2, 6; the instruction contains a rs field that 
is used to specify the address that the instruction is dependent on]. 

1 7. Regarding claim 4, Patterson in view of Yeh discloses a method as claimed in claim 2, 
wherein one of said regions contains symbolic machine code containing information, where said 



Application/Control Number: 1 0/52 1 ,585 * Page 8 

Art Unit: 2183 

information contains a symbolic constant of said region and the value of the said symbolic 
constant [Patterson, page 118, paragraph 6; the I-type instruction, used by data transfer 
instructions, contains an address field for specifying with an immediate, or constant, the address 
in memory that will be accessed], wherein the same one of said regions contains a symbolic 
variable [page 118, paragraphs 1-2, 6; the instruction contains an opcode field to indicate that it 
is of the I-type format] that is used to determine one or more labels specifying each a dependence 
group which the symbolic variable pertains to [page 118, paragraphs 1-2, 6; the instruction 
contains a rs field that is used to specify the address that the instruction is dependent on]. 
1 8. Regarding claims 7- 1 0, Patterson in view of Yeh discloses the claimed subject matter. 
Claims 7-10 recite numerous characteristics that the microprocessor and the instructions of the 
microprocessor may have. As is indicated by the difference between claim 1, which states 
"where said region (Rl) may further contain one or more of the following information: 
information (IF3)" ... "information (IF6)", and claim 3, which depends on claim 1 and states 
"where said region (Rl) further contains one or more of the following information: information 
(IF3)" ... "information (IF6)". The only difference between the two limitations is that claim 1 
states "may further contain" where claim 3 states "further contains". It is clear from this 
example that Applicant intends for the modifier "may" to indicate that it is possible for any 
subsequent limitations to exist in the system, but they do not necessarily have to exist. If this 
were not the case, claim 3 (and claim 4, for similar reasons) would be an improper dependent 
claim for failing to further limit claim 1 . Using this interpretation, the subject matter of claims 7- 
10 does not require any of the limitations regarding the microprocessor or the instructions, but 
rather only requires that the limitations be possible. Because there is no evidence in either 
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Patterson or Yeh that precludes the inclusion of the claimed subject matter in the processor of 
Patterson in view of Yeh, such a processor reads on claims 7-10. 



1 9. The prior art made of record and not relied upon is considered pertinent to Applicant's 
disclosure. The cited references are relate closely to the subject matter of the present application 
and should be fully considered in any response to this Office Action. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Corey S. Faherty whose telephone number is (571) 270-1319. 
The examiner can normally be reached on Monday-Thursday and every other Friday, 7:00-4:30. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Eddie Chan can be reached on (571) 272-4162. The fax phone number for the 
organization where this application or proceeding is assigned is 571-273-8300. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would 
like assistance from a USPTO Customer Service Representative or access to the automated 
information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



Conclusion 



Corey S Faherty 
Examiner 
Art Unit 2183 
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Programming languages have simple variables that contain single data ele- 
ments as in these examples, but they also have more complex data structures 
such as arrays. These complex data structures can contain many more data 
elements than there are registers in a machine. How can a computer represent 
and access such large structures? 

Recall the five components of a computer introduced in Chapter 1 and de- 
picted on page 105. The processor can keep only a small amount of data in reg- 
isters, but computer memory contains millions of data elements. Hence data 
structures, such as arrays, are kept in memory. 

As explained above, arithmetic operations occur only on registers in MIPS 
instructions; thus MIPS must include instructions that transfer data between 
memory and registers. Such instructions are called data transfer instructions. To 
access a word in memory, the instruction must supply the memory address. 
Memory is just a large, single-dimensional array, with the address acting as the 
index to that array, starting at 0. For example, in Figure 3.2, the address of the 
third data element is 2, and the value of Memory[2] is 10. 

The data transfer instruction that moves data from memory to a register is 
traditionally called load. The format of the load instruction is the name of the 
operation followed by the register to be loaded, then a constant and register 
used to access memory The memory address is formed by the sum of the con- 
stant portion of the instruction and the contents of the second register. The 
actual MIPS name for this instruction is 1 w, standing for load word. 



3 100 
I 2 10 

1 101 
0 1 
Address . Data 
Processor Memory 

I . I 

FIGURE 3.2 Memory addresses and contents of memory at those locations. This is a sim- 
plification of the MIPS addressing; Figure 3.3 shows MIPS addressing for sequential words in 
memory. 
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Chapter 6 Enhancing Performance with Pipelining 
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FIGURE 6.58 A superscalar datapath. The superscalar additions are highlighted: another 32 bits from instruction 
memory, two more read ports and one more write port on the register hie, and another ALU. Assume the bottom ALU han- 
dles address calculations for data transfers and the top ALU handles everything else. 



registers for the ALU operation and two more for a store, and also one write 
port for an ALU operation and one write port for a load. Since the ALU is tied 
up for the ALU operation, we also need a separate adder to calculate the effec- 
tive address for data transfers. Without these extra resources, our superscalar 
pipeline would be hindered by structural hazards. 

There is another difficulty that may limit the effectiveness of a superscalar 
pipeline. In our simple MIPS pipeline, loads have a latency of 1 clock cycle, 
which prevents one instruction from using the result without stalling. In the 
superscalar pipeline, the result of a load instruction cannot be used on the next 
clock cycle. This means that the next two instructions cannot use the load result 
without stalling. To effectively exploit the parallelism available in a super- 
scalar processor, more ambitious compiler or hardware scheduling techniques 
are needed, as well as more complex instruction decoding. 
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Alternative Implementations of Two-Level Adaptive Branch Prediction 



Tsc-Yu Yeh and Yale N. Patt 
Department of Electrical Engineering and Computer Science 
The University of Michigan 
Ann Arbor, Michigan 48109-2122 



Abstract 

As the issue rate and depth of pipelining of high perfor- 
mance Superscalar processors increase, the importance 
of an excellent branch predictor becomes more vital to 
delivering the potential performance of a wide-issue, 
deep pipelined microarchitecture. We propose a new 
dynamic branch predictor (Two-Level Adaptive Branch 
Prediction) that achieves substantially higher accuracy 
than any other scheme reported in the literature. The 
mechanism uses two levels of branch history information 
to make predictions, the history of the last k branches 
encountered, and the branch behavior for the last s oc- 
currences of the specific pattern of these k branches. We 
have identified three variations of the Two-Level Adap- 
tive Branch Prediction, depending on how finely we re- 
solve the history information gathered. We compute the 
hardware costs of implementing each of the three varia- 
tions, and use these costs in evaluating their relative ef- 
fectiveness. We measure the branch prediction accuracy 
of the three variations of Two-Level Adaptive Branch 
Prediction, along with several other popular proposed 
dynamic and static prediction schemes, on the SPEC 
benchmarks. We show that the average prediction ac- 
curacy for Two-Level Adaptive Branch Prediction is 97 
percent, while the other known schemes achieve at most 
94.4 percent average prediction accuracy. We measure 
the effectiveness of different prediction algorithms and 
different amounts of history and pattern information. 
We measure the costs of each variation to obtain the 
same prediction accuracy. 

1 Introduction 

As the issue rate and depth of pipelining of high per- 
formance Superscalar processors increase, the amount 
of speculative work due to branch prediction becomes 
much larger. Since all such work must be thrown away 
if the prediction is incorrect, an excellent branch pre- 
dictor is vital to delivering the potential performance of 
a wide-issue, deep pipelined microarchitecture. Even a 
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prediction miss rate of 5 percent results in a substantial 
loss in performance due to the number of instructions 
fetched each cycle and the number of cycles these in- 
structions are in the pipeline before an incorrect branch 
prediction becomes known. 

The literature is full of suggested branch prediction 
schemes [6, 13, 14, 17]. Some are static in that they use 
opcode information and profiling statistics to make pre- 
dictions. Others are dynamic in that they use run- time 
execution history to make predictions. Static schemes 
can be as simple as always predicting that the branch 
will be taken, or can be based on the opcode, or on the 
direction of the branch, as in "if the branch is backward, 
predict taken, if forward, predict not taken" [17]. This 
latter scheme is effective for loop intensive code, but 
docs not work well for programs where the branch be- 
havior is irregular. Also, profiling [6, 13] can be used to 
predict branches by measuring the tendency of a branch 
on sample data sets and presetting a static prediction 
bit in the opcode according to that tendency. Unfor- 
tunately, branch behavior for the sample data may be 
very different from the data that appears at run-time. 

Dynamic branch prediction also can be as simple as in 
keeping track only of the last execution of that branch 
instruction and predicting the branch will behave the 
same way, or it can be elaborate as in maintaining 
very large amounts of history information. In all cases, 
the fact that the dynamic prediction is being made on 
the basis of run-time history information implies that 
substantial additional hardware is required. J. Smith 
[17] proposed utilizing a branch target buffer to store, 
for each branch, a two-bit saturating up-down counter 
which collects and subsequently bases its prediction on 
branch history information about that branch. Lee and 
A. Smith proposed [14] a Static Training method which 
uses statistics gathered prior to execution time coupled 
with the history pattern of the last k run-time execu- 
tions of the branch to make the next prediction as to 
which way that branch will go. The major disadvantage 
of Static Training methods has been mentioned above 
with respect to profiling; the pattern history statistics 
gathered for the sample data set may not be applicable 
to the data that appears at run-time. 

In this paper we propose a new dynamic branch pre- 
dictor that achieves substantially higher accuracy than 
any other scheme reported in the literature. The mech- 
anism uses two levels of branch history information to 
make predictions. The first level is the history of the 
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last k branches encountered. (Variations of our scheme 
reflect whether this means the actual last k branches en- 
countered, or the last k occurrences of the same branch 
instruction.) The second level is the branch behavior 
for the last s occurrences of the specific pattern of these 
k branches. Prediction is based on the branch behavior 
for the last s occurrences of the pattern in question. 

For example, suppose, for k = 8, the last k branches 
had the behavior 11100101 (where 1 represents that the 
branch was taken, 0 that the branch was not taken). 
Suppose further that s = 6, and that in each of the last 
six times the previous eight branches had the pattern 
11100101, the branch alternated between taken and not 
taken. Then the second level would contain the history 
101010. Our branch predictor would predict "taken." 

The history information for level 1 and the pattern 
information for level 2 are collected at run time, elimi- 
nating the above mentioned disadvantages of the Static 
Training method. We call our method Two-Level Adap- 
tive Branch Prediction. We have identified three vari- 
ations of Two- Level Adaptive Branch Prediction, de- 
pending on how finely we resolve the history informa- 
tion gathered. We compute the hardware costs of im- 
plementing each of the three variations, and use these 
costs in evaluating their relative effectiveness. 

Using trace-driven simulation of nine of the ten SPEC 
benchmarks 1 , we measure the branch prediction ac- 
curacy of the three variations of Two-Level Adaptive 
Branch Prediction, along with several other popular 
proposed dynamic and static prediction schemes. We 
measure the effectiveness of different prediction algo- 
rithms and different amounts of history and pattern 
information. We measure the costs of each variation 
to obtain the same prediction accuracy. Finally we 
compare the Two-Level Adaptive branch predictors to 
the several popular schemes available in the literature. 
We show that the average prediction accuracy for Two- 
Level Adaptive Branch Prediction is about 97 percent, 
while the other schemes achieve at most 94.4 percent 
average prediction accuracy. 

This paper is organized in six sections. Section two 
introduces our Two-Level Adaptive Branch Prediction 
and its three variations. Section three describes the cor- 
responding implementations and computes the associ- 
ated hardware costs. Section four discusses the Simula- 
tion model and traces used in this study. Section five 
reports the simulation results and our analysis. Section 
six contains some concluding remarks. 

2 Definition of Two-Level Adaptive Branch 
Prediction 

2.1 Overview 

Two-Level Adaptive Branch Prediction uses two levels 
of branch history information to make predictions. The 
first level is the history of the last k branches encoun- 
tered. (Variations of our scheme reflect whether this 



means the actual last k branches encountered, or the 
last Jb occurrences of the same branch instruction.) The 
second level is the branch behavior for the last s oc- 
currences of the specific pattern of these k branches. 
Prediction is based on the branch behavior for the last 
s occurrences of the pattern in question. 

To maintain the two levels of information, Two-Level 
Adaptive Branch Prediction uses two major data struc- 
tures, the branch history register (HR) and the pattern 
history table (PHT), see Figure 1. Instead of accumu- 
lating statistics by profiling programs, the information 
on which branch predictions are based is collected at 
run-time by updating the contents of the history regis- 
ters and the pattern history bits in the entries of the 
pattern history table depending on the outcomes of the 
branches. The history register is a fc-bit shift register 
which shifts in bits representing the branch results of 
the most recent k branches. 
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Figure 1: Structure of Two-Level Adaptive Branch Pre- 
diction. 

If the branch was taken, then a "1" is recorded; if 
not, a "0" is recorded. Since there are k bits in the 
history register, at most 2* different patterns appear in 
the history register. For each of these 2* patterns, there 
is a corresponding entry in the pattern history table 
which contains branch results for the last s times the 
preceding k branches were represented by that specific 
content of the history register. 

When a conditional branch B is being predicted, 
the content of its history register, H R, denoted as 

H c -fc/2 c -fc+i Rc-ii is used to address the pattern 

history table. The pattern history bits S c in the ad- 
dressed entry PHT Rc _ kRc _ k + x in the pattern his- 
tory table are then used for predicting the branch. The 
prediction of the branch is 



z c = A(5 C ), 



(1) 



1 The Nasa7 benchmark was not simulated because this bench- 
mark consists of seven independent loops. It takes too long to 
simulate the branch behavior of these seven kernels, so we omit- 
ted these loops. 



where A is the prediction decision function. 

After the conditional branch is resolved, the out- 
come R c is shifted left into the history register HR 
in the least significant bit position and is also used 
to update the pattern history bits in the pattern his- 
tory table entry PHT Re _ kRe _ k ^ h c _ 1 . After being 
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updated, the content of the history register becomes 

flc-*+i^c-*+2 a °d the state represented by the 

pattern history bits becomes S c +i- The transition of the 
pattern history bits in the pattern history table entry 
is done by the state transition function 6 which takes 
in the old pattern history bits and the outcome of the 
branch as inputs to generate the new pattern history 
bits. Therefore, the new pattern history bits S c +i be- 
come 

Sc+i =*(S c ,i*c). (2) 

A straightforward combinational logic circuit is used to 
implement the function 6 to update the pattern history 
bits in the entries of the pattern history table. The tran- 
sition function 6, predicting function A, pattern history 
bits S and the outcome R of the branch comprise a 
finite-state Moore machine, characterized by equations 
1 and 2. 

State diagrams of the finite-state Moore machines 
used in this study for updating the pattern history in 
the pattern history table entry and for predicting which 
path the branch will take are shown in Figure 2. The 
automaton Last- Time stores in the pattern history only 
the outcome of the last execution of the branch when 
the history pattern appeared. The next time the same 
history pattern appears the prediction will be what hap- 
pened last time. Only one bit is needed to store that 
pattern history information. The automaton A \ records 
the results of the last two times the same history pat- 
tern appeared. Only when there is no taken branch 
recorded, the next execution of the branch when the 
history register has the same history pattern will be 
predicted as not taken; otherwise, the branch will be 
predicted as taken. The automaton A2 is a saturating 
up-down counter, similar to the automaton used in J. 
Smiths branch target buffer design for keeping branch 
history [17]. 




Figure 2: State diagrams of the finite-state Moore ma- 
chines used for making prediction and updating the pat- 
tern history table entry. 

In J, Smith's design the 2-bit saturating up-down 
counter keeps track of the branch history of a certain 
branch. The counter is incremented when the branch 



is taken and is decremented when the branch is not 
taken. The branch path of the next execution of the 
branch will be predicted as taken when the counter value 
is greater than or equal to two; otherwise, the branch 
will be predicted as not taken. In Two-Level Adap- 
tive Branch Prediction, the 2-bit saturating up-down 
counter keeps track of the history of a certain history 
pattern. The counter is incremented when the result of 
a branch, whose history register content is the same as 
the pattern history table entry index, is taken; other- 
wise, the counter is decremented. The next time the 
branch has the same history register content which ac- 
cesses the same pattern history table entry, the branch is 
predicted taken if the counter value is greater or equal 
to two; otherwise, the branch is predicted not taken. 
Automata A3 and AA are variations of A2. 

Both Static Training [14] and Two-Level Adaptive 
Branch Prediction are dynamic branch predictors, be- 
cause their predictions are based on run-time informa- 
tion, i.e. the dynamic branch history. The major dif- 
ference between these two schemes is that the pattern 
history information in the pattern history table changes 
dynamically in Two-Level Adaptive Branch Prediction 
but is preset in Static Training from profiling. In Static 
Training, the input to the prediction decision function, 
A, for a given branch history pattern is known before 
execution. Therefore, the output of A is determined be- 
fore execution for a given branch history pattern. That 
is, the same branch predictions are made if the same 
history pattern appears at different times during execu- 
tion. Two- Level Adaptive Branch Prediction, on the 
other hand, updates the pattern history information 
kept in the pattern history table with the actual results 
of branches. As a result, given the same branch his- 
tory pattern, different pattern history information can 
be found in the pattern history table; therefore, there 
can be different inputs to the prediction decision func- 
tion for Two-Level Adaptive Branch Prediction. Predic- 
tions of Two-Level Adaptive Branch Prediction change 
adaptivcly as the program executes. 

Since the pattern history bits change in Two-Level 
Adaptive Branch Prediction, the predictor can adjust to 
the current branch execution behavior of the program to 
make proper predictions. With these run-time updates, 
Two-Level Adaptive Branch Prediction can be highly 
accurate over many different programs and data sets. 
Static Training, on the contrary, may not predict well 
if changing data sets brings about different execution 
behavior. 

2.2 Alternative Implementations of Two-Level 
Adaptive Branch Prediction 

There are three alternative implementations of the Two- 
Level Adaptive Branch Prediction, as shown in Figure 
3. They are differentiated as follows: 

Two- Level Adaptive Branch Prediction Using a 
Global History Register and a Global Pattern 
History Table (GAg) 

In GAg, there is only a single global history regis- 
ter (GHR) and a single global pattern history table 
(GPHT) used by the Two-Level Adaptive Branch Pre- 
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Figure 3: Global view of three variations of Two- Level 
Adaptive Branch Prediction. 



diction. All branch predictions are based on the same 
global history register and global pattern history table 
which are updated after each branch is resolved. This 
variation therefore is called Global Two- Level Adaptive 
Branch Prediction using a global pattern history table 
(GAg). 

Since the outcomes of different branches update the 
same history register and the same pattern history table, 
the information of both branch history and pattern his- 
tory is influenced by results of different branches. The 
prediction for a conditional branch in this scheme is ac- 
tually dependent on the outcomes of other branches. 

Two-Level Adaptive Branch Prediction Using a 
Per-address Branch History Table and a Global 
Pattern History Table (PAg) 

In order the reduce the interference in the first level 
branch history information, one history register is as- 
sociated with each distinct static conditional branch to 
collect branch history information individually. The his- 
tory registers are contained in a per-address branch his- 
tory table (PBHT) in which each entry is accessible by 
one specific static branch instruction and is accessed by 
branch instruction addresses. Since the branch history 
is kept for each distinct static conditional branch indi- 
vidually and all history registers access the same global 
pattern history table, this variation is called Per-address 
Two-Level Adaptive Branch Prediction using a global 
pattern history table (PAg). 

The execution results of a static conditional branch 
update the branch's own history register and the global 
pattern history table. The prediction for a conditional 
branch is based on the branch's own history and the 
pattern history bits in the global pattern history table 
entry, indexed by the content of the branch's history 
register. Since all branches update the same pattern 
history table, the pattern history interference still exists. 



Two-Level Adaptive Branch Prediction Using 
Per-address Branch History Table and Per- 
address Pattern History Tables (PAp) 



In order to completely remove the interference in both 
levels, each static branch has its own pattern history ta- 
ble a set of which is called a per-address pattern history 
table (PPHT). Therefore, a per-address history register 
and a per-address pattern history table are associated 
with each static conditional branch. All history regis- 
ters are grouped in a per-address branch history table. 
Since this variation of Two- Level Adaptive Branch Pre- 
diction keeps separate history and pattern information 
for each distinct static conditional branch, it is called 
Per-address Two-Level Adaptive Branch Prediction us- 
ing Per-address pattern history tables (PAp). 

3 Implementation Considerations 

3.1 Pipeline Timing of Branch Prediction and 
Information Update 

Two- Level Adaptive Branch Prediction requires two se- 
quential table accesses to make a prediction. It is dif- 
ficult to squeeze the two accesses into one cycle. High 
performance requires that prediction be made within 
one cycle from the time the branch address is known. 
To satisfy this requirement, the two sequential accesses 
are performed in two different cycles as follows: When a 
branch result becomes known, the branch's history reg- 
ister is updated. In the same cycle, the pattern history 
table can be accessed for the next prediction with the 
updated history register contents derived by appending 
the result to the old history. The prediction fetched 
from the pattern history table is then stored along with 
the branch's history in the branch history table. The 
pattern history can also be updated at that time. The 
next time that branch is encountered, the prediction is 
available as soon as the branch history table is accessed. 
Therefore, only one cycle latency is incurred from the 
time the branch address is known to the time the pre- 
diction is available. 

Sometimes the previous branch results may not be 
ready before the prediction of a subsequent branch takes 
place. If the obsolete branch history is used for making 
the prediction, the accuracy is degraded. In such a case, 
the predictions of the previous branches can be used to 
update the branch history. Since the prediction accu- 
racy of Two- Level Adaptive Branch Prediction is very 
high, prediction is enhanced by updating the branch his- 
tory speculatively. The update timing for the pattern 
history table, on the other hand, is not as critical as that 
of the branch history; therefore, its update can be de- 
layed until the branch result is known. With speculative 
updating, when a misprediction occurs, the branch his- 
tory can either be reinitialized or repaired depending on 
the hardware budget available to the branch predictor. 
Also, if two instances of the same static branch occur 
in consecutive cycles, the latency of prediction can be 
reduced for the second branch by using the prediction 
fetched from the pattern history table directly. 

3.2 Target Address Caching 

After the direction of a branch is predicted, there is 
still the possibility of a pipeline bubble due to the time 
it takes to generate the target address. To eliminate 
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this bubble, we cache the target addresses of branches. 
One extra field is required in each entry of the branch 
history table for doing this. When a branch is predicted 
taken, the target address is used to fetch the following 
instructions; otherwise, the fall- through address is used. 

Caching the target addresses makes prediction in con- 
secutive cycles possible without any delay. This also 
requires the branch history table to be accessed by the 
fetching address of the instruction block rather than by 
the address of the branch in the instruction block being 
fetched because the branch address is not known until 
the instruction block is decoded. If the address hits in 
the branch history table, the prediction of the branch 
in the instruction block can be made before the instruc- 
tions are decoded. If the address misses in the branch 
history table, either there is no branch in the instruction 
block fetched in that cycle or the branch history infor- 
mation is not present in the branch history table. In this 
case, the next sequential address is used to fetch new in- 
structions. After the instructions are decoded, if there is 
a branch in the instruction block and if the instruction 
block address missed in the branch history table, static 
branch prediction is used to determine whether or not 
the new instructions fetched from the next sequential 
address should be squashed. 



3.3 Per-address Branch History Table Imple- 
mentation 

PAg and PAp branch predictors all use per-address 
branch history tables in their structure. It is not fea- 
sible to have a branch history table large enough to 
hold all branches* execution history in real implemen- 
tations. Therefore, a practical approach for the per- 
address branch history table is proposed here. 

The per-address branch history table can be imple- 
mented as a set- associative or direct-mapped cache. A 
fixed number of entries in the table are grouped together 
as a set. Within a set, a Least- Recently- Used (LRU) al- 
gorithm is used for replacement. The lower part of a 
branch address is used to index into the table and the 
higher part is stored as a tag in the entry associated 
with that branch. When a conditional branch is to be 
predicted, the branch's entry in the branch history ta- 
ble is located first. If the tag in the entry matches the 
accessing address, the branch information in the entry 
is used to predict the branch. If the tag does not match 
the address, a new entry is allocated for the branch. 

In this study, both the above practical approach and 
an Ideal Branch History Table (IBHT), in which there 
is a history register for each static conditional branch, 
were simulated for Two-Level Adaptive Branch Predic- 
tion. The branch history table was simulated with four 
configurations: 4-way set-associative 512-entry, 4-way 
set- associative 256-entry, direct-mapped 512-entry and 
direct-mapped 256-entry caches. The IBHT simulation 
data is provided to show the accuracy loss due to the 
history interference in a practical branch history table 
implementations. 



3.4 Hardware Cost Estimates 

The chip area required for a run-time branch predic- 
tion mechanism is not inconsequential. The following 
hardware cost estimates are proposed to characterize 
the relative costs of the three variations. The branch 
history table and the pattern history table are the two 
major parts. Detailed items include storage space for 
keeping history information, prediction bits, tags, and 
LRU bits and the accessing and updating logic of the 
tables. The accessing and updating logic consists of 
comparators, MUXes, LRU bits incrementors, and ad- 
dress decoders for the branch history table, and address 
decoders and pattern history bit update circuits for the 
pattern history table. The storage space for caching tar- 
get addresses is not included in the following equations 
because it is not required for the branch predictor. 
Assumptions of these estimates are: 

• There are a address bits, a subset of which is used 
to index the branch history table and the rest are 
stored as a tag in the indexed branch history table 
entry. 

• In an entry of the branch history table, there are 
fields for branch history, an address tag, a predic- 
tion bit, and LRU bits. 

• The branch history table size is h. 

• The branch history table is 2 J -way set- associative. 

• Each history register contains k bits. 

• Each pattern history table entry contains 5 bits. 

• Pattern history table set size is p. (In PAp, p is 
equal to the size of the branch history table, ft, while 
in GAg and PAg, p is always equal to one.) 

• C s , C d , C c , C mi C 9hi C*, and C a are the constant 
base costs for the storage, the decoder, the com- 
parator, the multiplexer, the shifter, the incremen- 
tor, and the finite-state machine. 

Furthermore, i is equal to log^h and is a non- negative 
integer. ^When there are k bits in a history register, a 
pattern history table always has 2* entries. 

The hardware cost of Two-Level Adaptive Branch 
Prediction is as follows: 

CostscH**nc(BHT(hJ,k) tP x PHT(2 k ,s)) 

- C03tBHT(h, j, k) + P X C0StpHT{2 k t s) 

— {B HTstoragc- Space + B HTAcQC«9*ng^Logic + 

B HTupdating-Logtc) + P X {P HTstoragc-Spacc + 
P HTAcceaaing-Logtc + PHTupdalmg -Logic} 

= {[h x (To<? (o _ l+J ) -6lt + HRk-bit + PredictionJBiti^t 
+LRV-Bits 3jbt t)] + 
[1 x Address. Decoder t j»t + 2 J x 
Comparators( a -i+ } y_ blt + 1 X 2 J X\-MUXk„bn] + 
[h x Shi f terabit + 2 J x LRU J ncr t mentor s 3 j>m)} + 
p x {[2* x History JBitSt_b t% ] + 
[1 x Address-Decoder k j,i t ] + [State JJ pdater ._&,»]} 
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= {k x [(a - i + j) + k + 1+ j) x C, + 

[h x C d + 2 3 x (a - i + ;) x C c + 2 J x Jt x C m ] + 
[fc x it x C th + 2 J x j x d]} +p x {[2 k x a x CJ + 
[2* x C d ] + [a x 2 i+1 x CJ}, a + j > i. (3) 

In GAg, only one history register and one global pat- 
tern history table are used, so h and p are both equal to 
one. No tag and no branch history table accessing logic 
are necessary for the single history register. Besides, 
pattern history state updating logic is small compared 
to the other two terms in the pattern history table cost. 
Therefore, cost estimation function for GAg can be sim- 
plified from Function 3 to the following Function: 

Cost G Ag{BHT(l f t k) t lx PHT{2 k ,s)) 
= Cost B HT{l, , k) + 1 x CostriiT(2 k , s) 
~ {[* + l]xC, + fcxC lh l + 

{2 k xisxC. + CJ) (4) 

It is clear to see that the cost of GAg grows exponen- 
tially with respect to the history register length. 

In PAg, only one pattern history table is used, so p 
is equal to one. Since j and s are usually small com- 
pared to the other variables, by using Function 3, the 
estimated cost for PAg using a branch history table is 
as follows: 

Coat PAg (BHT(hJ t k) t l x PHT(2 k ,s)) 

= Cost DI lT(hJ,k) + 1 X CostpHT^yS) 

~ {h x [(a + 2 x j + k + 1 - t) x C 9 + C d + 
k x C 3h )} + 

{2* x(jxC, + Cd)], a + j>t. (5) 

The cost of a PAg scheme grows exponentially with 
respect to the history register length and linearly with 
respect to the branch history table size. 

In a PAp scheme using a branch history table as de- 
fined above, h pattern history tables are used, so p is 
equal to h. By using Function 3, the estimated cost for 
PAp is as follows: 

CostPA P (BHT(h t j,k),kx PHT(2 k % s)) 

= C03t B HT{h,j,k) +h X C0St F H T (2 k >3) 

~ {h x [(a + 2 x j + Jb + 1 - i) x C, + C d + 
k x Csh]} + 

h x {2 k x (s x C t + C d )}, a + >>!'. (6) 

When the history register is sufficiently large, the cost 
of a PAp scheme grows exponentially with respect to the 
history register length and linearly with respect to the 
branch history table size. However, the branch history 
table size becomes a more dominant factor than it is in 
a PAg scheme. 

4 Simulation Model 

Trace-driven simulations were used in this study. A Mo- 
torola 88100 instruction level simulator is used for gen- 
erating instruction traces. The instruction and address 
traces are fed into the branch prediction simulator which 
decodes instructions, predicts branches, and verifies the 
predictions with the branch results to collect statistics 
for branch prediction accuracy. 



4.1 Description of Traces 

Nine benchmarks from the SPEC benchmark suite are 
used in this branch prediction study. Five are float- 
ing point benchmarks and four are integer benchmarks. 
The floating point benchmarks include doduc i fpppp, 
matrix300, spice2$6 and tomcatv and the integer ones 
include eqntoit, espresso , gcc } and it. Nasa7 is not in- 
cluded because it takes too long to capture the branch 
behavior of all seven kernels. 

Among the five floating point benchmarks, fpppp, 
matrixZQO and tomcatv have repetitive loop execution; 
thus, a very high prediction accuracy is attainable, in- 
dependent of the predictors used. Doduc, spice2g6 and 
the integer benchmarks are more interesting. They have 
many conditional branches and irregular branch behav- 
ior. Therefore, it is on the integer benchmarks where a 
branch predictor's mettle is tested. 

Since this study of branch prediction focuses on the 
prediction for conditional branches, all benchmarks 
were simulated for twenty million conditional branch 
instructions except gec which finished before twenty 
million conditional branch instructions are executed. 
Fpppp, matrix 3 00, and tomcatv were simulated for 100 
million instruction because of their regular branch be- 
havior through out the programs. The number of static 
conditional branches in the instruction traces of the 
benchmarks are listed in Table 1. History register hit 
rate usually depends on the number of static branches 
in the benchmarks. The testing and training data sets 
for each benchmark used in this study are listed in Table 
2. 



Benchmark 


Number of 


Benchmark 


Number of 




Static 




Static 


Name 


Cnd. Br. 


Name 


Cnd. Br. 


eqntott 


277 


espresso 


556 


gec 


6922 


ii 


489 


doduc 


1149 


fpppp 


653 


matrix300 


213 


spice2g6 


606 


tomcatv 


370 







Table 1: Number of static conditional branches in each 
benchmark. 



Benchmark 


Training 


Testing 


Name 


Data Set 


Data Set 


eqntott 


NA 


int-prL3.eqn 


espresso 


cps 


bca 


gec 


cexp.i 


dbxout.i 


xlisp 


tower of hanoi 


eight queens 


doduc 


tiny doducin 


doducin 


fpppp 


NA 


natoms 


matrix300 


NA 


Built-in 


spice2g6 


short greycode.in 


greycode.in 


tomcatv 


NA 


Built-in 



Table 2: Training and testing data sets of benchmarks. 
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In the traces generated with the testing data sets, 
about 24 percent of the dynamic instructions for the 
integer benchmarks and about 5 percent of the dy- 
namic instructions for the floating point benchmarks 
are branch instructions. Figure 4 shows about 80 per- 
cent of the dynamic branch instructions are conditional 
branches; therefore, the prediction mechanism for con- 
ditional branches is the most important among the pre- 
diction mechanisms for different classes of branches. 
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Figure 4: Distribution of dynamic branch instructions. 



4.2 Characterization of Branch Predictors 

The three variations of Two- Level Adaptive Branch 
^Prediction were simulated with several configura- 
tions. Other known dynamic and static branch 
predictors were also simulated. The configura- 
tions of the dynamic branch predictors are shown 
in Table 3. In order to distinguish the different 
schemes we analyzed, the following naming conven- 
tion is used: Scheme( History( Size, Associativity, 
Entry-Content), Pattern-TableSet-Size x Pattern( 
Size, Entry-Content), Context. Switch). If a predictor 
does not have a certain feature in the naming conven- 
tion, the corresponding field is left blank. 

Scheme specifies the scheme, for example, GAg, 
PAg, PAp or Branch Target Buffer design (BTB) 
[17]. In History( Size, Associativity, Entry-Content), 
H istory is the entity used to keep history information 
of branches, for example, HR (A single history register), 
IBHT, or BHT. Size specifies the number of entries in 
that entity, Associativity is the associativity of the ta- 
ble, and Entry-Content specifies the content in each 
branch history table entry. When Associativity is set 
to 1, the branch history table is direct- mapped. The 
content of an entry in the branch history table can be 
any automaton shown in Figure 2 or simply a history 
register. 

In Pattern-Table-Set-Size x Pattern( 
Size, Entry-Content), Pattern _TableSetSize is the 
number of pattern history tables used in the scheme, 
Pattern is the implementation for keeping pattern his- 
tory information, Size specifies the number of entries in 
the implementation, and Entry-Content specifies the 



content in each entry. The content of an entry in the 
pattern history table can be any automaton shown in 
Figure 2. For Branch Target Buffer designs, the Pattern 
part is not included, because there is no pattern history 
information kept in their designs. Context Switch is 
a flag for context switches. When Context. Switch is 
specified as c, context switches are simulated. If it is 
not specified, no context switches are simulated. 

Since there are more taken branches than not taken 
branches according to our simulation results, a history 
register in the branch history table is initialized to all l's 
when a miss on the branch history table occurs. After 
the result of the branch which causes the branch history 
table miss is known, the result bit is extended through- 
out the history register. A context switch results in 
flushing and reinitialization of the branch history table. 
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Aac - Tabic Set- Associativity, Atm - Automaton, BHT - Branch 
History Tabic, BTB - Branch Target Buffer Design, Conftg. - 
Configuration, Entr. - Entries, GAg - Global Two-Level Adaptive 
Branch Prediction Using a Global Pattern History Table, GSg - 
Global Static Training Using a Preset Global Pattern History Ta- 
ble, IBHT - Ideal Branch History Table, inf - Infinite, LT - Last- 
Time, PAg - Per. address Two-Level Adaptive Branch Prediction 
Using a Global Pattern History Table, PAp - Per-address Two- 
Level Adaptive Branch Prediction Using Per-address Pattern His- 
tory Tables, PB - Preset Prediction Bit, PSg - Per-address Static 
Training Using a Preset Global Pattern History Table, PHT - Pat- 
tern History Table, ar - Shift Register. 

Table 3: Configurations of simulated branch predictors. 

The pattern history bits in the pattern history table 
entries are also initialized at the beginning of execution. 
Since taken branches are more likely for those pattern 
history tables using automata Al, A2, AS, and AA, all 
entries are initialized to state 3. For Last-Time, all en- 
tries are initialized to state 1 such that the branches at 
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the beginning of execution will be more likely to be pre- 
dicted taken. It is not necessary to reinitialize pattern 
history tables during execution. 

In addition to the Two-Level Adaptive schemes, Lee 
and A. Smiths Static Training schemes, Branch Tar- 
get Buffer designs, and some dynamic and static branch 
prediction schemes were simulated for comparison pur- 
poses. Lee and A. Smith's Static Training scheme is sim- 
ilar in structure to the Per-address Two-Level Adaptive 
scheme with an IBHT but with the important difference 
that the prediction for a given pattern is pre-determined 
by profiling. In this study, Lee and A. Smith's Static 
Training is identified as PSg, meaning per-address Static 
Training using a global preset pattern history table. 
Similarly, the scheme which has a similar structure to 
GAg but with the difference that the second-level pat- 
tern history information is collected from profiling is 
abbreviated PSg, meaning Global Static Training using 
a preset global pattern history tabic. Per-address Static 
Training using per-address pattern history tables (PSp) 
is another application of Static Training to a different 
structure; however, this scheme requires a tot of storage 
to keep track of pattern behavior of all branches stati- 
cally. Therefore, no PSp schemes were simulated in this 
study. Lee and A. Smith's Static Training schemes were 
simulated with the same branch history table configu- 
rations as used by the Two-Level Adaptive schemes for 
a fair comparison. The cost to implement Static Train- 
ing is not less expensive than the cost to implement the 
Two- Level Adaptive Scheme because the branch history 
table and the pattern history table required by both 
schemes are similar. In Static Training, before program 
execution starts, extra time is needed to load the preset 
pattern prediction bits into the pattern history table. 

Branch Target Buffer designs were simulated with 
automata A2 and Last-Time. The static branch pre- 
diction schemes simulated include the Always Taken, 
Backward Taken and Forward Not Taken, and a pro- 
filing scheme. Always Taken scheme predicts taken for 
all branches. Backward Taken and Forward Not Taken 
(BTFN) scheme predicts taken if a branch branches 
backward and not taken if the branch branches for- 
ward. The BTFN scheme is effective for loop-bound 
programs, because it mispredicts only once in the exe- 
cution of a loop. The profiling scheme counts the fre- 
quency of taken and not-taken for each static branch 
in the profiling execution. The predicted direction of 
a branch is the one the branch takes most frequently. 
The profiling information of a program executed with a 
training data set is used for branch predictions for the 
program executed with testing data sets, thus calculat- 
ing the prediction accuracy. 

5 Branch Prediction Simulation Results 

Figures 5 through 11 show the prediction accuracy of 
the branch predictors described in the previous session 
on the nine SPEC benchmarks. "Tot GMean" is the ge- 
ometric mean across all the benchmarks, "Int GMean" 
is the geometric mean across all the integer benchmarks, 
and "FP GMean" is the geometric mean across all the 
floating point benchmarks. The vertical axis shows the 



prediction accuracy scaled from 76 percent to 100 per- 
cent. 

5.1 Evaluation of the Parameters of the Two- 
Level Adaptive Branch Prediction Branch 
Prediction 

The three variations of Two-Level Adaptive Branch 
Prediction were simulated with different history regis- 
ter lengths to assess the effectiveness of increasing the 
recorded history length. The PAg and PAp schemes 
were each simulated with an ideal branch history ta- 
ble (IBHT) and with practical branch history tables to 
show the effect of the branch history table hit ratio. 

5.1.1 Effect of Pattern History Table Automa- 
ton 

Figure 5 shows the efficiency of using different finite- 
state automata. Five automata A\ y j42, ,43, A4, and 
Last-Time were simulated with a PAg branch predic- 
tor, having 12-bit history registers in a four-way set- 
associative 512-entry BUT. Al t A2 y 43, and AA all per- 
form better than Last-Time. The four-state automata 
j41, A2, j43, and A4 maintain more history information 
than Last-Time which only records what happened the 
last time; they are therefore more tolerant to the devi- 
ations in the execution history. Among the four-state 
automata, Al performs worse than the others. The per- 
formance of A2, AZ, and AA are very close to each other; 
however, A2 usually performs best. In order to show 
the following figures clearly, each Two-Level Adaptive 
Scheme is shown with automaton A2. 
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Figure 5: Comparison of Two-Level Adaptive Branch 
Predictors using different finite-state automata. 



5.1.2 Effect of History Register Length 

Three variations using history registers of the 
same length 

Figure 6 shows the effects of history register length on 
the prediction accuracy of Two- Level Adaptive schemes. 
Every scheme in the graph was simulated with the same 
history register length. Among the variations, PAp per- 
forms the best, PAg the second, and GAg the worst. 
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GAg is not effective with 6-bit history registers, because 
every branch updates the same history register, causing 
excessive interference. PAg performs better than GAg, 
because it has a branch history table which reduces the 
interference in branch history, PAp predicts the best, 
because the interference in the pattern history is re- 
moved. 
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Figure 6: Comparison of the Two-Level Adaptive 
schemes using history registers of the same length. 

Effects of various history register lengths 
To further investigate the effect of history register 
length, Figure 7 shows the accuracy of GAg with var- 
ious history register lengths. There is an increase of 9 
percent in accuracy by lengthening the history register 
from 6 bits to 18 bits. The effect of history register 
length is obvious on GAg schemes. The history regis- 
ter length has smaller effect on PAg schemes and even 
smaller effect on PAp schemes because of the less inter- 
ference in the branch history and pattern history and 
their effectiveness with short history registers. 
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Figure 7: Effect of various history register lengths on 
GAg schemes. 



5.1.3 Hardware Cost Efficiency of Three Vari- 
ations 

In Figure 6, prediction accuracy for the schemes with 
the same history register length were compared. How- 
ever, the various Two-Level Adaptive schemes have dif- 
ferent costs. PAp is the most expensive, PAg the second, 
and GAg the least, as you would expect. When evaluat- 
ing the three variations of Two- Level Adaptive Branch 
Prediction, it is useful to know which variation is the 
least expensive when they predict with approximately 
the same accuracy. 

Figure 8 illustrates three schemes which achieve about 
97 percent prediction accuracy. One scheme is chosen 
for each variation to show the variation's configuration 
requirements to obtain that prediction accuracy. To 
achieve 97 percent prediction accuracy, GAg requires an 
18-bit history register, PAg requires 12-bit history regis- 
ters, and PAp requires 6-bit history registers. According 
to our cost estimates, PAg is the cheapest among these 
three. GAg's pattern history table is expensive when a 
long history register is used. PAp is expensive due to 
the required multiple pattern history tables. 
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Figure 8: The Two-Level Adaptive schemes achieve 
about 97 percent prediction accuracy. 

5.1.4 Effect of Context Switch 

Since Two-Level Adaptive Branch Prediction uses the 
branch history table to keep track of branch history, the 
table needs to be flushed during a context switch. Fig- 
ure 9 shows the difference in the prediction accuracy 
for three schemes simulated with and without context 
switches. During the simulation, whenever a trap oc- 
curs in the instruction trace or every 500,000 instruc- 
tions if no trap occurs, a context switch is simulated. 
After a context switch, the pattern history table is not 
re-initialized, because the pattern history table of the 
saved process is more likely to be similar to the current 
process's pattern history table than to a re-initialized 
pattern history table. The value 500,000 is derived 
by assuming that a 50 MHz clock is used and context 
switches occur every 10 ms in a 1 IPC machine. The 
average accuracy degradations for the three schemes are 
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all less than 1 percent. The accuracy degradations for 
gcc when PAg and PAp are used are much greater than 
those of the other programs because of the large num- 
ber of traps in gcc. However, the excessive number of 
traps do not degrade the prediction accuracy of the GAg 
scheme, because an initialized global history register can 
be refilled quickly. The prediction accuracy of fpppp 
using GAg actually increases when context switches are 
simulated. There are very few conditional branches in 
fpppp and all the conditional branches have regular be- 
havior; therefore, initializing the global history register 
helps clear out the noise. 
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Figure 9: Effect of context switch on prediction accu- 
racy. 

5.1.5 Effect of Branch History Table Imple- 
mentation 

Figure 10 illustrates the effects of the size and associa- 
tivity of the branch history table in the presence of con- 
text switches. Four practical branch history table imple- 
mentations and an ideal branch history table were sim- 
ulated. The four-way set- associative 512-entry branch 
history tabled performance is very close to that of the 
ideal branch history table, because most branches in the 
programs can fit in the table. Prediction accuracy de- 
creases as table miss rate increases, which is also seen 
in the PAp schemes. 

5.2 Comparison of Two-Level Adaptive Branch 
Prediction and Other Prediction schemes 

Figure 11 compares the branch prediction schemes. The 
PAg scheme which achieves 97 percent prediction ac- 
curacy is chosen for comparison with other well-known 
schemes, because it costs the least among the three vari- 
ations of Two-Level Adaptive Branch Prediction. 

The 4-way set-associative 512-entry BHT is selected 
to be used by all schemes which keep the first- level 
branch history information, because it is simple enough 
to be implemented. The Two-Level Adaptive scheme 
and the Static Training scheme were chosen on the ba- 
sis of similar costs. 

The top curve is achieved by the Two-Level Adaptive 
scheme whose prediction accuracy is about 97 percent. 
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Figure 10: Effect of branch history table implementa- 
tion on PAg schemes. 



Since the data for the Static Training schemes are not 
complete due to the unavailability of appropriate data 
sets, the data points for cqntott, fpppp, ma*riz300 , and 
tomcatv arc not graphed. PSg is about 1 to 4 percent 
lower than the top curve for the benchmarks that are 
available and GSg is about 4 to 19 percent lower with av- 
erage prediction accuracy of 94.4 percent and 89 percent 
individually. Note that their accuracy depends greatly 
on the similarities between the data sets used for train- 
ing and testing. The prediction accuracy for the branch 
target buffer using 2-bit saturating up-down counters 
[17] is around 93 percent. The Profiling scheme achieves 
about 91 percent prediction accuracy. The branch tar- 
get buffer using Last-Time achieves about 89 percent 
prediction accuracy. Most of the prediction accuracy 
curves of BTFN and Always Taken are below the base 
line (76 percent). BTFN's average prediction accuracy 
is about 68.5 percent and Always Taken's is about 62.5 
percent. In this figure, the Two-Level Adaptive scheme 
is superior to the.other schemes by at least 2.6 percent. 
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Figure 11: Comparison of branch prediction schemes. 
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6 Concluding Remarks 

In this paper we have proposed a new dynamic branch 
predictor (Two-Level Adaptive Branch Prediction) that 
achieves substantially higher accuracy than any other 
scheme that we are aware of. We computed the hard- 
ware costs of implementing three variations of this 
scheme and determined that the most effective imple- 
mentation of Two-Level Adaptive Branch Prediction 
utilizes a per-address branch history table and a global 
pattern history table. 

We have measured the prediction accuracy of the 
three variations of Two- Level Adaptive Branch Pre- 
diction and several other popular proposed dynamic 
and static prediction schemes using trace-driven sim- 
ulation of nine of the ten SPEC benchmarks. We have 
shown that the average prediction accuracy for Two- 
Level Adaptive Branch Prediction is about 97 percent, 
while the other known schemes achieve at most 94.4 
percent average prediction accuracy. 

We have measured the effects of varying the param- 
eters of the Two-Level Adaptive predictors. We noted 
the sensitivity to k, the length of the history register, 
and s, the size of each entry in the pattern history ta- 
ble. We reported on the effectiveness of the various 
prediction algorithms that use the pattern history table 
information. We showed the effects of context switch- 
ing. 

Finally, we should point out that we feel our 97 per- 
cent prediction accuracy figures are not good enough 
and that future research in branch prediction is still 
needed. High performance computing engines in the 
future will increase the issue rate and the depth of 
the pipeline, which will combine to increase further the 
amount of speculative work that will have to be thrown 
out due to a branch prediction miss. Thus, the 3 per- 
cent prediction miss rate needs improvement. We are 
examining that 3 percent to try to characterize it and 
hopefully reduce it. 
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